Picture for Xiang Fang

Xiang Fang

Turing Patterns for Multimedia: Reaction-Diffusion Multi-Modal Fusion for Language-Guided Video Moment Retrieval

Add code
Jun 01, 2026
Viaarxiv icon

Hierarchical Semantic-Augmented Navigation: Optimal Transport and Graph-Driven Reasoning for Vision-Language Navigation

Add code
Jun 01, 2026
Viaarxiv icon

SLAP: The Semantic Least Action Principle for Variational Video-Language Modeling

Add code
May 29, 2026
Viaarxiv icon

Immuno-VLM: Immunizing Large Vision-Language Models via Generative Semantic Antibodies for Open-World Trustworthiness

Add code
May 29, 2026
Viaarxiv icon

Annotations Are Not All You Need: A Cross-modal Knowledge Transfer Network for Unsupervised Temporal Sentence Grounding

Add code
May 29, 2026
Viaarxiv icon

CogniVerse: Revolutionizing Multi-Modal Retrieval-Augmented Generation with Cognitive Reflection and Geometric Reasoning

Add code
May 28, 2026
Viaarxiv icon

Not All Inputs Are Valid: Towards Open-Set Video Moment Retrieval Using Language

Add code
May 28, 2026
Viaarxiv icon

Fewer Steps, Better Performance: Efficient Cross-Modal Clip Trimming for Video Moment Retrieval Using Language

Add code
May 28, 2026
Viaarxiv icon

Rethinking Video-Language Model from the Language Input Perspective

Add code
May 27, 2026
Viaarxiv icon

Towards Unified Vision-Language Models with Incomplete Multi-Modal Inputs

Add code
May 27, 2026
Viaarxiv icon